Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming
نویسنده
چکیده
This paper extends previous work with Dyna a class of architectures for intelligent systems based on approximating dynamic program ming methods Dyna architectures integrate trial and error reinforcement learning and execution time planning into a single process operating alternately on the world and on a learned model of the world In this paper I present and show results for two Dyna archi tectures The Dyna PI architecture is based on dynamic programming s policy iteration method and can be related to existing AI ideas such as evaluation functions and uni versal plans reactive systems Using a nav igation task results are shown for a simple Dyna PI system that simultaneously learns by trial and error learns a world model and plans optimal routes using the evolving world model The Dyna Q architecture is based on Watkins s Q learning a new kind of rein forcement learning Dyna Q uses a less famil iar set of data structures than does Dyna PI but is arguably simpler to implement and use We show that Dyna Q architectures are easy to adapt for use in changing environments Introduction to Dyna How should a robot decide what to do The traditional answer in AI has been that it should deduce its best action in light of its current goals and world model i e that it should plan However it is now widely recognized that planning s usefulness is limited by its computational complexity and by its dependence on an accurate world model An alternative approach is to do the planning in advance and compile its result into a set of rapid reactions or situation action rules which are then used for real time decision making Yet a third approach is to learn a good set of reactions by trial and error this has the advantage of eliminating the dependence on a world model In this paper I brie y introduce Dyna a class of simple architectures integrating and permitting tradeo s among these three approaches Dyna architectures use machine learning algo rithms to approximate the conventional optimal con trol technique known as dynamic programming DP Bellman Ross DP itself is not a learn ing method but rather a computational method for determining optimal behavior given a complete model of the task to be solved It is very similar to state space search but di ers in that it is more incremental and never considers actual action sequences explicitly only single actions at a time This makes DP more amenable to incremental planning at execution time and also makes it more suitable for stochastic or in completely modeled environments as it need not con sider the extremely large number of sequences possi ble in an uncertain environment Learned world mod els are likely to be stochastic and uncertain making DP approaches particularly promising for learning sys tems Dyna architectures are those that learn a world model online while using approximations to DP to learn and plan optimal behavior Intuitively Dyna is based on the old idea that planning is like trial and error learning from hypothet ical experience Craik Dennett The theory of Dyna is based on the theory of DP e g Ross and on DP s relationship to reinforcement learning Watkins Barto Sutton Watkins to temporal di erence learning Sutton and to AI methods for planning and search Korf Werbos has previously argued for the general idea of building AI systems that approx imate dynamic programming and Whitehead and others Sutton Barto Sutton Pinette Rumelhart et al have presented results for the speci c idea of augmenting a reinforcement learning system with a world model used for planning Dyna PI Dyna by Approximating Policy Iteration I call the rst Dyna architecture Dyna PI because it is based on approximating a DP method known as pol icy iteration Howard The Dyna PI architec ture consists of four components interacting as shown in Figure The policy is simply the function formed by the current set of reactions it receives as input a description of the current state of the world and pro duces as output an action to be sent to the world The world represents the task to be solved prototypi cally it is the robot s external environment The world receives actions from the policy and produces a next state output and a reward output The overall task is de ned as maximizing the long term average reward per time step cf Russell The architecture also includes an explicit world model The world model is intended to mimic the one step input output behavior of the real world Finally the Dyna PI architecture in cludes an evaluation function that rapidly maps states to values much as the policy rapidly maps states to actions The evaluation function the policy and the world model are each updated by separate learning processes WORLD Action Reward (scalar) Heuristic Reward (scalar) State EVALUATION FUNCTION
منابع مشابه
Integrated Modeling and Control Based on Reinforcement Learning
This is a summary of results with Dyna, a class of architectures for intelligent systems based on approximating dynamic programming methods. Dyna architectures integrate trial-and-error (reinforcement) learning and execution-time planning into a single process operating alternately on the world and on a learned forward model of the world. We describe and show results for two Dyna architectures,...
متن کاملIntegrated Architectures for Learning , Planning , and ReactingBased
This paper extends previous work with Dyna, a class of architectures for intelligent systems based on approximating dynamic programming methods. Dyna architectures integrate trial-and-error (reinforcement) learning and execution-time planning into a single process operating alternately on the world and on a learned model of the world. In this paper, I present and show results for two Dyna archi...
متن کاملA Comprehensive Mathematical Model for the Design of a Dynamic Cellular Manufacturing System Integrated with Production Planning and Several Manufacturing Attributes
Dynamic cellular manufacturing systems, Mixed-integer non-linear programming, Production planning, Manufacturing attributes This paper presents a novel mixed-integer non-linear programming model for the design of a dynamic cellular manufacturing system (DCMS) based on production planning (PP) decisions and several manufacturing attributes. Such an integrated DCMS model with an extensi...
متن کاملIntegrated Inspection Planning and Preventive Maintenance for a Markov Deteriorating System Under Scenario-based Demand Uncertainty
In this paper, a single-product, single-machine system under Markovian deterioration of machine condition and demand uncertainty is studied. The objective is to find the optimal intervals for inspection and preventive maintenance activities in a condition-based maintenance planning with discrete monitoring framework. At first, a stochastic dynamic programming model whose state variable is the ...
متن کاملA Multiagent Variant of Dyna-Q
This paper describes a multiagent variant of Dyna-Q called M-Dyna-Q. Dyna-Q is an integrated single-agent framework for planning, reacting, and learning. Like DynaQ, M-Dyna-Q employs two key ideas: learning results can serve as a valuable input for both planning and reacting, and results of planning and reacting can serve as a valuable input to learning. M-Dyna-Q extends Dyna-Q in that planning...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1990